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Abstract 

We study the feasibility of a PC-based parallel computer for medium to large scale lattice 
QCD simulations. The Eotvos Univ., Inst. Theor. Phys. cluster consists of 137 Intel P4-1.7GHz 
nodes with 512 MB RDRAM. The 32-bit, single precision sustained performance for dynamical 
QCD without communication is 1510 Mflops/node with Wilson and 970 Mflops/node with 
staggered fermions. This gives a total performance of 208 Gflops for Wilson and 133 Gflops for 
staggered QCD, respectively (for 64-bit applications the performance is approximately halved). 
The novel feature of our system is its communication architecture. In order to have a scalable, 
cost-effective machine we use Gigabit Ethernet cards for nearest-neighbor communications in a 
two-dimensional mesh. This type of communication is cost effective (only 30% of the hardware 
costs is spent on the communication). According to our benchmark measurements this type 
of communication results in around 40% communication time fraction for lattices upto 48 3 • 96 
in full QCD simulations. The price/sustained-performance ratio for full QCD is better than 
$l/Mflops for Wilson (and around $1.5/Mflops for staggered) quarks for practically any lattice 
size, which can fit in our parallel computer. The communication software is freely available 
upon request for non-profit organizations. 



1 Introduction 

The most powerful computers for lattice gauge theory are industrial supercomputers or special pur- 
pose parallel computers (see e.g. [1, 2, 3, 4]). Nevertheless, it is more and more accepted that even 
off-the-shelf processors (e.g. PC processors) can be used to build parallel computers for lattice QCD 
simulations [5, 6, 7, 9]. 
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There are obvious advantages of PC based systems. Single PC hardware usually has excellent 
price/performance ratios for both single and double precision applications. In most cases the op- 
erating system (Linux), compiler (gcc) and other software are free. Another advantage of using 
PC/Linux based systems is that lattice codes remain portable. Furthermore, due to their price they 
are available for a broader community working on lattice gauge theory. For recent review papers and 
benchmarks see [10, 11, 12]. 

Despite the obvious advantages it is not clear how good is the price/performance ratio in medium 
or large scale lattice QCD applications, for which a large number of processors are needed. Clearly, 
these multi-processor computer systems dissipate a non-negligible amount of heat. The components 
should be reliable in order to avoid frequent system crashes. Most importantly, the communication 
network should be fast and cheap enough in order to keep the good price/performance ratio of the 
single PCs. 

We present our experiences and benchmark results. We show that a parallel system e.g. based on 
(9(100) PCs can run for several weeks or more without system crashes. The costs for air-conditioning 
are quite small. Such a system is scalable, it uses nearest-neighbor communication through Gigabit 
Ethernet cards. The communication is fast enough (consuming 40% of the total time in typical 
applications) and cheap enough (30% of the total price is spent on the communication). The system 
can sustain 100-200 Gflops on today's medium to large lattices. This result gives a price/performance 
ratio better than $l/Mflops for 32-bit applications (twice as much for 64-bit applications). 

Already in 1999 a report was presented [7] on the PC-based parallel computer project at the 
Eotvos University, Budapest, Hungary. A machine was constructed with 32 PCs arranged in a 
three-dimensional 2x4x4 mesh. Each node had two special, purpose designed communication cards 
providing communication through flat cables to the six neighbors. We used the "multimedia exten- 
sion" instruction set of the processors to increase the performance. The bandwidth between nodes 
(16 Mbit/s) was sufficient for physical problems, which needed large bosonic systems 1 ; however, 
the communication was far too slow for lattice QCD. Here we report on a system which needs no 
hardware development and has two orders of magnitude larger communication bandwidth through 
Gigabit Ethernet cards. Our system is based on PCs, each with a P4-1.7GHz processor and 512 MB 
RDRAM memory. This system with its Gigabit Ethernet communication represents a good balance 
between CPU power and network bandwidth. According to our knowledge this is the first scalable 
parallel computer for lattice QCD with a price/sustained-performance ratio better than $l/Mflops. 

In Section 2 we briefly describe the hardware and discuss its cost. Section 3 deals with the 
software used by our system, particularly with the C library we developed for communication. In 
Section 4 we discuss physics issues and present benchmark results based on the MILC code in Section 
5. Section 6 contains our conclusions. 

2 Hardware 

The nodes of our system are almost complete PCs (namely PCs without video cards, floppy- and 
CD-drives) with four additional Gigabit Ethernet cards. Each node consists of an Intel KD850GB 
motherboard, Intel- P4-1.7G Hz processor, 512 MB RDRAM, 100 Mbit Ethernet card, 20.4 GB IDE 
HDD and four SMC9452 Gigabit Ethernet cards for the two dimensional communication. 

Altogether we have 137 PC nodes. One might use it as one cluster with 128 nodes (or two clusters 
with 64 nodes or four clusters with 32 nodes) for mass production 2 . Two smaller clusters with 4 
nodes are installed for development with or without communication. One additional, single machine 
is used for compilation, controlling the system and for smaller tests. 

1 SU(2)-Higgs model for the finite temperature clcctrowcak phase transition (for simulation techniques see e.g. [8]) 
2 We have 264 SMC9452 cards and our benchmark results were obtained on 64-, 32-, f6-, 8- and 4-node clusters 
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Figure 1: Illustration of a 64-node cluster. The PC nodes are arranged in a 8x8 two-dimensional 
torus. Each node is an almost complete PC with four additional Gigabit Ethernet cards. The four 
Gigabit cards are used to reach the four neighboring nodes through cross-twisted copper Cat5 UTP 
cables. As indicated, at the boundaries periodic boundary conditions are realized. 

After turning the machines on each node boots Linux from its own hard-disk. All nodes can 
be accessed through a 100 Mbit Ethernet network with switches. The single node (the "main" 
computer) controls the whole cluster. A simple job-management package has been written to copy 
the executable code and data to and from the nodes and to execute the programs. 

Each motherboard has five PCI slots (32bit/33MHz). One of them is used for the 100Mbit 
Ethernet card 3 . These cards are connected to six Ovislink FSH24TX+ 24 port 100 Mbit Ethernet 
switches. Controlling and copying files from and to the nodes is done through this network. The other 
PCI slots contain the Gigabit Ethernet cards. Each Gigabit Ethernet card is connected to a Gigabit 
card of a neighboring node by a cross-twisted copper Cat5 UTP cable. This allows us to perform 
nearest-neighbor communication. The number of the PCI slots on the motherboard determines the 
dimensionality of the resulting mesh. In our case there are five slots. Four slots are used for Gigabit 
communication in four directions, thus in two dimensions (see Figure 1). In principle, we could have 
used the remaining fifth PCI slot for a third dimension of the mesh, though in this third dimension 
the mesh would have only two nodes (due to the periodic boundary condition a single connection is 
enough in this special case). This arrangement results in a 2x8x8 mesh for 128 nodes. Similarly, 
other PCI slots can also be used for dimensions with extension of 2 nodes. This idea could result in 

3 note, that another version of this motherboard is equipped with the 100 Mbit Ethernet interface, thus all five PCI 
slots can be used for other purposes 
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meshes e.g. 2x2x2x8 or 2x2x8 for five or four PCI slots with Gigabit connections, respectively 4 . 
The best node-topology for a given lattice can be determined by optimizing the surface to volume 
ratio of the local sub-lattice. Changing the node topology needs reconnecting the Cat5 UTP cables 
of the system. This procedure is extremely easy, one person can do it in less than 10 minutes. Note, 
that presently even 32 or 64 nodes can accommodate fairly large lattices. The topology of e.g. a 
64- node system is 8x8 with periodic boundary conditions. Such a cluster is illustrated on Figure 1. 
Note, that the architecture of the machine is similar to that of the APE machines [1]. 

We started to use our cluster in October 2001. In this three months we never suffered from a 
system crash of a running PC due to hardware or OS problems. According to our experiences the 
system can run for several weeks without needing a reboot (after which we reboot anyhow). 

One of the most important advantage of PC-based parallel computers is their exceptional price 
(for price estimates see e.g. [13], though for 0(100) PCs one can easily reach better prices). In 
February, 2002 the price of one node including the 100 Mbit Ethernet switches is $502. The four 
Gigabit Ethernet cards cost additional $240 for each node (including the cables, which is less than 
$1 for each). The power consumption of one node is 140 W. Since the machines dissipate a non- 
negligible amount of heat, we had to install a cooling system. The costs of the cooling system, 
when distributed among the individual nodes, resulted in $13 for one node. Thus, the total node 
(PC+communication+cooling) price is $755. 

As we will discuss it in Section 4 the typical performance of one node with communication 
overhead is around 800-900 Mflops for QCD with Wilson (500-600 Mflops with staggered) quarks. 
Note, that whenever we speak about performance we mean sustained-performance. Comparing these 
performance results and the above prices we end up with a price/performance ratio better than 
$l/Mflops for QCD with Wilson (and around $1.5/Mflops with staggered) quarks. 

There is an important message about our architecture. The key element of a cost effective design 
is an appropriate balance between communication and the performance of the nodes. As it can 
be seen we spent more than twice as much on the bare PC ($515 including cooling) than for the 
fast Gigabit communication ($240). These numbers are in strong contrast with Myrinet based PC 
systems, for which the high price of the Myrinet card exceeds the price of such a PC by a factor 
of two [14]. Similarly, motherboards with 64bit/66MHz PCI buses are expensive, too. Since even 
for our Gigabit Ethernet based machine the communication overhead is only around 40% (and in 
principle could be halved) a faster communication can not speed up significantly the system. Clearly, 
these ratios result in a far worse price/performance relationship for Myrinet based systems than for 
our architecture. 

As we will see the time needed for the calculation is dominated by memory access. This can not 
be made better by having faster communication or processor. There are in principle two ways to 
improve the obtained price/performance ratio for PC-based systems. First of all, one can push the 
costs of such a system down. This is realized in our case by choosing relatively cheap PC components 
with the high memory bandwidth RDRAM and a rather cheap communication. Secondly, one might 
use much faster memory access for a comparable price (e.g. use even faster memory access on the 
motherboard or put the whole lattice to the cache memory). In this case the faster communication 
would pay off. Unfortunately, this second option can not be used cost effectively right now. 

3 Software 

The main operating system of the cluster is SuSE Linux 7.1, being installed on each node, separately. 
As it was mentioned earlier, the machines are connected via Gigabit Ethernet cards, 4 cards per 

4 note, that having 2 nodes in one dimension does not reduces the communication bandwidth, which is essentially 
fixed by the PCI bus speed 
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machine building up a 2 dimensional periodic matrix with 4 neighbors for each node (with computers 
on the side being connected to the one on the opposite side). The job management is done by the 
"main" computer through an additional 100 Mbit Ethernet network (see e.g. [7]). This separate PC 
is where the main part of the program development and compilation is done and only the binary 
code is copied to the nodes. Communication development is done on a separate small 4-PC cluster. 

Nodes have special hostnames indicating their location in the matrix. For the 128-node cluster 
the names are p00,p01 . . . p07,pl0,pll . . . pf7. The two hexadecimal digits after the leading p 
represent the two coordinates of the node within the 16x8 matrix. 

In order to take advantage of the Gigabit communication from applications (e.g. C, C++ or 
Fortran code), a simple C library was written using the standard Linux network interface. Currently, 
we are using a standard socket based communication with Transmission Control Protocol (TCP) 
widely used in network applications. Clearly, TCP is much more advanced than our needs for a 
nearest-neighbor communication. This results in a somewhat large communication overhead. As we 
discuss it later, the typical measured communication bandwidth in QCD applications is around 400 
Mbit/s, which should be compared with the maximal theoretical bandwidth of 1000 Mbit/s and with 
the 800 Mbit / sec measured bandwidth for large packages. In order to eliminate the overhead due to 
TCP/IP we are working on a new, more efficient driver. 

During the startup period of the connection odd nodes act as clients, while even nodes are the 
servers (for the node with hostname pxy the parity is defined by the parity of x+y). After the network 
sockets (one two-way socket per link) are established they are kept open till the end of the program. 
During the execution of the program the previous "clients" and "servers" act symmetrically, both 
sending and receiving information to or from the neighboring nodes. 

The functions of the library handle the data exchange between the neighbors. They can be 
summarized as follows (the prefix "gb" refers to Gigabit). 

int gb_open(int *nodes_x, int *nodes_y) 

Opens four two-way network sockets to the neighboring nodes. On each node hostnames dirO 
(right), dirl (up), dir2 (down) and dirS (left) are defined to the IP address of the appropriate 
neighboring Ethernet card in the /etc/hosts file. First, the number of available machines is checked 
in the two directions and the variables nodesjx (left-right) and nodesjy (up-down) are set accordingly. 
On success gb-open returns NULL, otherwise -1. 

int gb_close() 

Closes all four network sockets. No further communication is possible after this call. On success 
gb_close returns NULL, otherwise -1. 

int gb_send(int dir, char *buffer, int size) 
int gb_recv(int dir, char *buffer, int size) 

Sends/receives data to/from the given direction. Directions are (right), 1 (up), 2 (down) and 
3 (left). Data are provided through a pointer buffer with length size in bytes. In case of receive 
operation size is the length of the buffer provided. On success the number of transferred bytes is 
returned, otherwise the return value is -1. 

int gb_send_recv(int dir, char *send_buf f er , char *recv_buffer int size) 

A combination of send and receive. Sends data to the given direction and receives from the 
opposite direction. sendJbuffer and recv_buffer are pointers to the data to be sent/received. On 
success the total number of transferred bytes is returned (2 x size), otherwise the return value is -1. 
This function expects exactly size bytes of data to be received and blocks other operation until the 
whole amount arrives. 
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int gb_collect (int dir, char *buffer, char *result, int size) 

Collects data from all the nodes in the direction dir to the array result. The buffer size should be 
at least the number of nodes in the given direction multiplied by size. The data to be transmitted 
from the current node is placed in buffer. It will be copied to the first size bytes in result on that 
node. The data from other nodes will be packed after each other and ordered in result by the distance 
of the originating node in the given direction dir. On success the total number of transmitted bytes 
is returned. 

Synchronization can be an important issue for parallel computing. For the presented system it is 
solved by the architecture itself. Each node uses its own clock and waits until the neighbors provide 
the necessary data to proceed in the program. Since the Gigabit Ethernet cards have their own 
memory there is no need to synchronize the "send" and "receive" instructions. 

4 Computational requirements for lattice problems 

Large scale lattice QCD simulations are focused on a few topics. They might be divided into two 
computationally different classes. 

a. One of them is full QCD. In this case the most important feature of a parallel computer is its 
maximal CPU performance in Gflops since most of the time is spent on the updating process. The 
computational requirements for new theoretical developments are similar to those of full QCD. 

b. The other class is heavy quark phenomenology (quenched or unquenched). In this case not 
only the CPU performance, but also the memory requirements for the measurements play a very 
important role. 5 

First we discuss full QCD. Only a small fraction (12-18% for large volumes with Wilson fermions) 
of our 137-node system's total nominal performance -0.93 Tflops for single precision and 0.47 Tflops 
for double precision calculations- can be obtained as a sustained value for full QCD simulations. 
Before we show the detailed benchmark results it is instructive to discuss some features of the cal- 
culations. The basic reason for the relatively small sustained-performance/peak-performance ratio 
is the memory bandwidth. If the pure computational performance is P, the memory bandwidth is 
B and the number of accessed bytes per operations in the code is F then the maximal sustained 
performance we can reach (assuming that memory access and computation is not performed simul- 
taneously) is P s = 1/(1/P + F/B). The theoretical computational performance is P = 6.8 Gflop/s 
since using the P4 processor's sse instruction set, four single precision operations can be performed 
in one clock cycle. The maximum bandwidth of the RAMBUS architecture is B = 3.2 GB/s. The 
value of F is 1.27 bytes/flop for Wilson and 1.81 bytes/flop for staggered quarks (cf. [11]). This 
gives an upper limit for the sustained performances, ^1800 Mflops for Wilson and ^1400 Mflops for 
staggered quarks. Note, that in both cases P s is dominated by the term F/B (essentially determined 
by the memory bandwidth) and not by 1/P (fixed by the theoretical computational performance). 
With careful cache- management one can prefetch the next required data concurrently with calcula- 
tion. By this technique one increases the maximal performance to P s « B/F. This estimate further 
emphasizes the importance of the memory bandwidth. 

Now we discuss the question, how good is the presented PC based parallel computer for heavy 
quark phenomenology. For this sort of physics problem a important computer issue is the memory 
requirement for the measurement of a heavy-light form-factor. At present we have 512 Mbytes/node. 
This gives for the 128 nodes 64 Gbyte, which is the memory of the possibly largest APEmille system 
of 8x8x32 [2]). The memory of the presented system can be easily extended. One might have 1.5 

5 We carry out full QCD simulations (staggered quarks at finite temperature and/or chemical potential [17]), thus 
our presented computer with its memory is somewhat less appropriate for heavy quark phenomenology; however, its 
memory can be easily extended. 
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Figure 2: Full QCD performance results on a 64- node cluster. We indicate the node performance (left 
vertical axis) and the total performance of the 64-node cluster as a function of the spatial extension 
L s . The lattice sizes are chosen to be L 3 x 2L S . Squares indicate Wilson, triangles staggered dynamical 
QCD simulation performances, respectively. The dotted line represents the $l/Mflops value. Lattices 
of size 16 3 x32 or smaller fits in the memory of one node. For these lattice sizes single nodes were 
used without communication. For larger lattices communication was switched on. This resulted in 
an « 30% reduction of the sustained performances. The inserted scale shows the (effective) machine 
size as the number of communicating nodes on which the lattice is distributed. 



Gbyte/node, which increases the costs by approximately 35%. This gives 200 Gbyte memory for 
a 128-node cluster. If only two propagators at the time are kept in memory, while the others are 
stored and reloaded from disks, memory requirements reduce sharply (see e.g. the analysis of the 
apeNEXT project [2]). Note, that this option needs a high bandwidth to and from the disks. Our 
measured total disk input/output bandwidth for a 128-node cluster is 1.2 Gbytes/s, thus even very 
large propagators can be loaded or stored in less than a minute. Due to the usage of the local disks 
this bandwidth scales with the number of nodes and the required time for input/output remains the 
same. Combining the above features it is easy to see that with extended memory fairly large lattices, 
e.g. of the order of 64 3 ■ 96 can be studied for heavy quark phenomenology on a 128-node cluster. 
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5 Performance 



In this section we discuss the benchmark results for the full QCD case. Figure 2 gives a summary 
of these runs. Single precision is used for local variables (gauge links and vectors, see [15]) and dot 
products are accumulated in double precision. Using double precision gauge links and vectors reduces 
the performance by approximately 50%. The reason for this reduction is twofold. On the one hand, 
both the memory access, and the communication between the nodes needs twice as much time; on 
the other hand, sse2-performance (double precision) is just the half of the sse-performance (single 
precision). In the simulations the R- algorithm is used [16]. Our results correspond to the full code 
performances (not only to the conjugate gradient iteration) 6 . 

For benchmarking we started from the MILC Collaboration's public code [15] . In order to increase 
the performance of the code we modified the original MILC code by three different techniques. 

First of all, we used the "multimedia extension" instruction set of the Intel-P4 processors (note, 
that the same sse instruction set is now available on Athlon XP processors, too). As we pointed out it 
in 1999 [7] this capability can accelerate the processor by a large factor (in Ref. [7] we used AMD-K6 
processors, which performs 4 operations for each clock cycle by using its multimedia extension). The 
multimedia extension of the Intel-P4 processor works as a vector unit with 128 bits wide registers, 
which can be used for arithmetic operations with 2 double precision or 4 single precision numbers. 
There is a simple way to access these registers, work with them and obtain very good performances 
(see M. Liischer's excellent review [12]). We rewrote almost the whole conjugate gradient part 
of the program to assembly (including also loops over the lattice and not only elementary matrix 
operations). No preconditioning was used. The 577(3) matrices and vectors were stored in a format 
that fits well to the sse architecture (see the appendix for details). This way we obtained a speedup 
factor of ~2 in the performances (similar benchmark results were obtained by e.g. Ref. [18]). 

Secondly, an important speedup was obtained by changing the data structure of the original 
MILC code suggested by S. Gottlieb [19]. The original code is based on the "site major" concept. 
This structure contains all the physical variables of a given site and the lattice is an array of these 
structures. Instead of this concept one should use "field major" variables. The set of a given type of 
variable (e.g. gauge links) of the different sites are collected and stored sequentially. In this case the 
gauge fields and the vectors are much better localized in the memory. If a cache line contains data 
not needed for the current site it is most probably some data of a neighboring site, which would be 
needed soon. Using the original structure this additional data of the next bytes of the "site major" 
structure is another physical variable, which is most likely not needed by the next calculational 
steps. Thus, "site major" structure effectively spoils the memory bandwidth, which is one of the 
most important factors of the performance. Similarly to Ref. [19] by changing the "site major" code 
to a "field major" one we observed a speedup factor of ~2 in the performances. 

The third ingredient of our improvement was a better cache management. The number of suc- 
cessful cache-hits was enhanced by extensive use of the "prefetch" instruction. This technique further 
increased the sustained performance of the code. 

One of the most obvious features of Figure 2 is the sharp drop of the performance when one 
turns on the communication. The most economic solution is to turn on the communication only if 
the memory is insufficient for the single-node mode. Smaller lattices can be also studied by using 
the communication (for instance for thermalization or parameter tuning); however, in these cases 
the communication overhead increases somewhat. E.g. having twice as large surface/volume ratio 
than the optimal one on a fixed number of communicating nodes we observed an additional 5-10% 
reduction of the sustained performance. 

6 For the staggered case the full code has been tested using communication. For the Wilson case the computationally 
most important (Dw + mo)ip is tested with communication. The full Wilson performance is estimated based on 
(Dw + mo)t/i, the same for staggered quarks ((D s + mo)ip) and the total staggered performances. 
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Figure 3: Sustained staggered performance of a 4- node system as a function of the local lattice 
volume V=L 3 x L s /2 (lower panel). The node topology is 2x2, thus the total lattice volume is 
L 3 x 2L S . The two almost parallel lines represent the results obtained by a myrinet system (upper 
line; these tests were perfomed on the myrinet cluster of the DESY Theory group, Hamburg) and 
by our gigabit communication (lower line). The one- node performance is also shown by the almost 
horizontal line with stars. Clearly, communication is of no use if the one-node performance is better 
than the 4-node one. Thus, the useful region is given by local lattices 8 3 x 4 or larger. The relative 
difference between the performances of the myrinet and gigabit systems is shown by the upper panel. 
As it can be seen in the useful region the relative difference is always less than 12%. 



It is instructive to study the question, how the communication overhead can be reduced. We 
needed more than 1 node when lattices of 18 3 -36 or larger were studied. With a somewhat better 
memory allocation lattices of 20 3 -40 can still fit in 1 node. In this case the volume/surface ratio of 
the local sub-lattice gets better resulting in a better communication overhead (e.g. in this case one 
would obtain 30% instead of 35%, which is slightly better). The typical measured communication 
bandwidth in QCD applications was around 400 Mbit/s. This is far less than the maximal theoretical 
bandwidth of 1000 Mbit/s. Since we are using the most straightforward communication based on 
TCP/IP it is feasible that a more effective driver can reach a bandwidth around 800 Mbit/sec. In 
this case the communication overhead would be around 20%. 

There is a simple theoretical consideration 7 to describe the performance of our architecture as 
P(V,A) = Pq(V)/(1 + c • A/V), where Pq is the performance without communications for a given 
local lattice volume V on each node. A is the number of "surface" lattice sites (per node) which 

we thank our referee for this transparent interpretation 
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participate in the communication. If each node holds L x ■ L y lattice sites in the communication plane 
one has A/V = 2(L X + L y ) / (L x ■ L y ), which becomes A/V = 4/ \J~N~ P I L s if the ■ 2L S lattice can be 
distributed in an optimal way (L x = L y ) over the 2-dimensional array of N p processors. Considering 
only a limited range of local lattice volumes the variation of Pq(V) due to cache effects is negligible. 
From the points for L s =12,...,16 in Figure 2 one can read off a value P « 1.63 Gflops. The above 
formula for P(V, A) reproduces all measured values shown in Figure 2 for L s >12 with an accuracy 
of a few percent using c=1.55. 

Note, that the "simulation scale" of a given physics problem depends on the lattice size used plus 
the CPU cost to obtain a certain result. The communication overhead of the present architecture is 
relatively large, i.e. even in the optimal case (the local lattice is large, occupying the whole memory) 
it is 30-40%. Taking smaller local lattices the sustained performance further drops. According 
to Figure 3 taking local volumes, which are 80 times smaller than the maximal ones results in an 
additional 50% drop in the sustained performance. When one uses the system as a scalable one 
and simulates relatively small local lattices this additional 50% reduction should be also taken into 
account. 

6 Conclusion 

We have presented a description of the status of our PC-based parallel computer project for lattice 
QCD. Nearest-neighbor communication is implemented in a two-dimensional mesh. Communication 
goes through Gigabit Ethernet cards. Some details of the communication software have been given. 
Presently, we work on a better driver to increase the communication bandwidth (in this way we 
expect to half the communication overhead). As it is usual for many PC clusters for our system it is 
also advisable to put sub-lattices on a relatively small number of nodes. In this case one simulates 
several independent smaller lattices in parallel. Using this technique one reaches an optimum for the 
surface to volume ratio and minimizes the time fraction for communication. For larger lattices the 
whole system should be used. 

The saturated dynamical QCD sustained performance of a 128 node system with communication 
is around 70 Gflops for staggered quarks and 110 Gflops for Wilson quarks (these numbers can be 
seen as a sort of average performances for typical applications using subsystems with different node 
numbers and relatively large local lattices; see Figures 2,3). 

Finally, we would like to emphasize that the system presented here 

1. is a scalable parallel computer with nearest-neighbor communication, 

2. can be assembled using standard PC components, 

3. takes advantage of the free software environment, 

4. has a very good price/sustained-performance ratio for full lattice QCD (with Wilson quarks 
less than $l/Mflops, for staggered quarks $1.5/Mflops), 

5. due to its large memory and its large total disk bandwidth it can be used for heavy quark 
phenomenology. 
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A Example source code 



The following source code demonstrates the usage of sse instructions and communication. 

* * 

* File: example. c * 

* Get SU(3) vectors from neighbors, multiply them by the * 

* local links and add the result * 

* * 



#include<stdio .h> 
#include<stdlib . h> 
#include"gb_lib . h" 

#define DIR.RIGHT 
#define DIR.UP 1 
#define DIR.DOWN 2 
#define DIR.LEFT 3 



/* su3 vector conveniently aligned for sse operations a_r[3] and a_i[3] 

are used only to maintain alignment */ 
typedef struct{ 

float a_r[4] ; 

float a_i[4] ; 
} sse_su3_vector attribute ((aligned (16))); 

/* su3 matrix */ 
typedef struct{ 

sse_su3_vector e [3] ; 
} sse_su3_matrix attribute ((aligned (16))); 

void sse_mult_su3_mat_vec(sse_su3_matrix *u, sse_su3_vector *src, 

sse_su3_vector *dest) ; 
void sse_add_su3_vec (sse_su3_vector *vl, sse_su3_vector *v2, 

sse_su3_vector *dest) ; 

void init_links(sse_su3_matrix *link){ 

/* Link initialization code may come here */ 

} 



void init_vec(sse_su3_vector *link){ 

/* Vector initialization code may come here */ 

} 



int main(int argc, char *argv[]){ 



static sse_su3_vector vect attribute ((aligned (16))); 

static sse_su3_vector tempi attribute ((aligned (16))); 
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static sse_su3_vector temp2 
static sse_su3_vector temp3 
static sse_su3_vector temp4 
static sse_su3_vector res 
static sse_su3_matrix link [2] 

int nodes_x,nodes_y ; 



attribute ((aligned (16)) 

attribute ((aligned (16)) 

attribute ((aligned (16)) 

attribute ((aligned (16)) 

attribute ((aligned (16)) 



/* Open communication channels */ 
if (gb_open(&nodes_x,&nodes_y) !=0) { 

printf ("Unable to open Gigabit communication channels\n") ; 

exit(-l) ; 

} 

printf ("Machine topology: %d x %d nodes\n" ,nodes_x,nodes_y) ; 
/* Initialize links and vector */ 
init_links(link) ; 
init_vec (fevect) ; 



/* Get vectors from neighbors */ 

gb_send_recv(DIR_LEFT, (char *)&vect , (char *)&templ, 

sizeof (sse_su3_vector) ) ; 
gb_send_recv(DIR_DOWN, (char *)&vect , (char *)&temp2, 

sizeof (sse_su3_vector) ) ; 

/* Perform multiplication with links */ 
sse_mult_su3_mat_vec(&(link[0] ) ,&templ ,&temp3) ; 
sse_mult_su3_mat_vec(&(link[l] ) ,&temp2,&temp4) ; 
sse_add_su3_vec (&temp3 , &temp4 , &res) ; 
gb_close() ; 
return (0) ; 



* 

File su3_sse.asm * 
SU(3) matrix-vector multiplication and vector addition * 
Compile with nasm v0.98 * 

* 



BITS 32 

GLOBAL sse_mult_su3_mat_vec 
GLOBAL sse_add_su3_vec 



REAL equ 
I MAG equ 16 

CO equ 
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CI equ 32 

C2 equ 64 



SU(3) matrix-vector multiplication 
C prototype: 

void sse_mult_su3_mat_vec(sse_su3_matrix *u, sse_su3_vector *src, 

sse_su3_vector *dest) ; 

dest=u*src 



sse_mult_su3_mat_vec : 

mov eax, [esp+04h] 
mov ecx, [esp+08h] 
mov edx, [esp+Och] 
prefetchtO [eax] 

movaps xmm6 , [ecx+REAL] ; first element of the vector 

movaps xmm7, [ecx+IMAG] 

shufps xmm6 ,xmm6 , 00000000b ; real part in xmm6 
shufps xmm7 , xmm7 , 00000000b ; imaginary part in xmm7 
movaps xmm4, [eax+CO+REAL] ; first column of the matrix 
movaps xmm5, [eax+CO+IMAG] ; to (xmm4,xmm5) 
movaps xmm0,xmm4 ; complex multiplication 

mulps xmm0,xmm6 
movaps xmm2,xmm7 
mulps xmm2,xmm5 
subps xmm0,xmm2 
movaps xmml,xmm4 
mulps xmml,xmm7 
mulps xmm5,xmm6 
addps xmml,xmm5 
movaps xmm6 , [ecx+REAL] 
movaps xmm7, [ecx+IMAG] 
shufps xmm6,xmm6, 01010101b 
shufps xmm7,xmm7, 01010101b 
movaps xmm4, [eax+Cl+REAL] 
movaps xmm5, [eax+Cl+IMAG] 
movaps xmm2,xmm4 
mulps xmm2,xmm6 
addps xmm0,xmm2 
movaps xmm2,xmm7 
mulps xmm2,xmm5 
subps xmm0,xmm2 
mulps xmm4,xmm7 
addps xmml , xmm4 
mulps xmm5,xmm6 
addps xmml,xmm5 
movaps xmm6 , [ecx+REAL] 
movaps xmm7, [ecx+IMAG] 
shufps xmm6,xmm6, 10101010b 



; real part goes to xmmO 



; imaginary part goes to xmml 
; second element of the vector 



; second column of the matrix 



; real part added to xmmO 



; imaginary part added to xmml 
; third element of the vector 
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shufps xmm7,xmm7, 10101010b 

movaps xmm4, [eax+C2+REAL] 

movaps xmm5, [eax+C2+IMAG] 

movaps xmm2,xmm4 

mulps xmm2,xmm6 

addps xmm0,xmm2 

movaps xmm2,xmm7 

mulps xmm2,xmm5 

subps xmm0,xmm2 

mulps xmm4,xmm7 

addps xmml , xmm4 

mulps xmm5,xmm6 

addps xmml , xmm5 

movaps [edx+REAL] ,xmm0 

movaps [edx+IMAG] ,xmml 

ret 



third column of the matrix 



real part added to xmmO 



imaginary part added to xmml 
store the result to dest 



SU(3) vector addition 
C prototype: 

void sse_add_su3_vec(sse_su3_vector *vl, sse_su3_vector *v2, 

sse_su3_vector *dest) ; 

dest=vl+v2 



sse_add_su3_vec : 

mov eax, [esp+04h] 
mov ecx, [esp+08h] 
mov edx, [esp+Och] 

movaps xmmO, [eax+REAL] ; real parts 

movaps xmml , [eax+IMAG] ; imaginary parts 

addps xmmO, [ecx+REAL] 

addps xmml, [ecx+IMAG] 

movaps [edx+REAL] ,xmm0 

movaps [edx+IMAG] ,xmml 

ret 
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